The Graduate School SEMI - SUPERVISED CLUSTERING FOR HIGH - DIMENSIONAL AND SPARSE FEATURES
نویسندگان
چکیده
Clustering is one of the most common data mining tasks, used frequently for data organization and analysis in various application domains. Traditional machine learning approaches to clustering are fully automated and unsupervised where class labels are unknown a priori. In real application domains, however, some “weak” form of side information about the domain or data sets can be often available or derivable. In particular, information in the form of instance-level pairwise constraints is general and is relatively easy to derive. The problem with traditional clustering techniques is that they cannot benefit from side information even when available. I study the problem of semi-supervised clustering, which aims to partition a set of unlabeled data items into coherent groups given a collection of constraints. Because semi-supervised clustering promises higher quality with little extra human effort, it is of great interest both in theory and in practice. Semi-supervised clustering shares a difficulty with a large number of other learning methods in data mining literature. That is, they lose their algorithmic effectiveness for high dimensional data. I focus on data with high-dimensional sparse features and present a series of novel semi-supervised clustering approaches that are
منابع مشابه
Semi-supervised Hierarchical Clustering Analysis for High Dimensional Data
In many data mining tasks, there is a large supply of unlabeled data but limited labeled data since it is expensive generated. Therefore, a number of semi-supervised clustering algorithms have been proposed, but few of them are specially designed for high dimensional data. High dimensionality is a difficult challenge for clustering analysis due to the inherent sparse distribution, and most of p...
متن کاملPairwise Constrained Clustering for Sparse and High Dimensional Feature Spaces
Clustering high dimensional data with sparse features is challenging because pairwise distances between data items are not informative in high dimensional space. To address this challenge, we propose two novel semi-supervised clustering methods that incorporate prior knowledge in the form of pairwise cluster membership constraints. In particular, we project high-dimensional data onto a much red...
متن کاملHyperspectral Image Classification Based on the Fusion of the Features Generated by Sparse Representation Methods, Linear and Non-linear Transformations
The ability of recording the high resolution spectral signature of earth surface would be the most important feature of hyperspectral sensors. On the other hand, classification of hyperspectral imagery is known as one of the methods to extracting information from these remote sensing data sources. Despite the high potential of hyperspectral images in the information content point of view, there...
متن کاملSparse Modeling of High - Dimensional Data for Learning and Vision
Sparse representations account for most or all of the information of a signal by a linear combination of a few elementary signals called atoms, and have increasingly become recognized as providing high performance for applications as diverse as noise reduction, compression, inpainting, compressive sensing, pattern classification, and blind source separation. In this dissertation, we learn the s...
متن کاملFused Feature Representation Discovery for High-Dimensional and Sparse Data
The automatic discovery of a significant low-dimensional feature representation from a given data set is a fundamental problem in machine learning. This paper focuses specifically on the development of the feature representation discovery methods appropriate for high-dimensional and sparse data. We formulate our feature representation discovery problem as a variant of the semi-supervised learni...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010